Is it really, um, revealing?

Josef Fruehwald

February 2, 2015

Intro

Two parts of this talk

  1. Increasing “um”: a language change in progress
  2. How informative is it?

The Data

UhUm Package

The raw data from the Philadelphia Neighborhood Corpus available here:

  library(devtools)
  install_github("jofrhwld/UhUm")

UhUm Package

  library(UhUm)
  head(um_PNC, 3)
##    idstring word start_time end_time vowel_start vowel_end nasal_start
## 1 PH00-1-1-   UH      24.39    24.69       24.39     24.69          NA
## 2 PH00-1-1-   UH      34.96    35.24       34.96     35.24          NA
## 3 PH00-1-1-   UM      37.90    38.27       37.90     38.12       38.12
##   nasal_end next_seg next_seg_start next_seg_end chunk_start chunk_end
## 1        NA        S          24.69        24.87       24.39     25.29
## 2        NA        F          35.24        35.35       34.96     37.11
## 3     38.27       sp          38.27        38.39       37.90     38.80
##   nwords sex year age ethnicity schooling transcribed total nvowels
## 1   6551   m 2000  21       i/r        14        2811  2814    3078
## 2   6551   m 2000  21       i/r        14        2811  2814    3078
## 3   6551   m 2000  21       i/r        14        2811  2814    3078

um_PNC

  um_PNC%>%
    group_by(word, sex)%>%
    summarise(n = n())%>%
    ungroup()%>%
    spread(sex, n)
## Source: local data frame [5 x 3]
## 
##     word    f    m
## 1 AND_UH  904 1176
## 2 AND_UM  314  153
## 3     UH 7523 9520
## 4     UM 4132 1792
## 5  UM_UH    7    2

Transcription

From the FAVE transcription guidelines:

trans

The Change

Um Increasing

plot of chunk unnamed-chunk-7

Um Increasing

plot of chunk unnamed-chunk-8

Filled Pauses

Flat (or decreasing?)

plot of chunk unnamed-chunk-9

Cohorts

plot of chunk unnamed-chunk-10

Filled Pause Rate

Cohorts

plot of chunk unnamed-chunk-11

Conditioning Factors

plot of chunk unnamed-chunk-12

Conditioning Factors

plot of chunk unnamed-chunk-13

Constant Rate Effect Modelling

Just female data:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.27 0.09 -2.90 0.00
fol_segC -2.03 0.09 -23.02 0.00
fol_segV -1.77 0.13 -13.60 0.00
decade 0.50 0.04 12.78 0.00
fol_segC:decade 0.23 0.04 5.54 0.00
fol_segV:decade 0.20 0.06 3.18 0.00

CRE Modelling

Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
cre_mod2 5 9893 9929 -4941 9883 NA NA NA
cre_mod 7 9856 9907 -4921 9842 40.75 2 0

Conditioning Factors

plot of chunk unnamed-chunk-16

Following Pause Duration

Following Pause

plot of chunk unnamed-chunk-17

Following Pause

plot of chunk unnamed-chunk-18

Following Pause

plot of chunk unnamed-chunk-19

Following Pause

## Warning: Removed 19 rows containing missing values (stat_smooth).
## Warning: Removed 53 rows containing missing values (stat_smooth).

plot of chunk unnamed-chunk-20

Following Pause

Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.72 0.10 -7.56 0.00
decade 0.59 0.04 14.00 0.00
log2dur 0.30 0.01 20.56 0.00
decade:log2dur -0.03 0.01 -4.71 0.00
## is_um ~ decade * log2dur + (1 | idstring)
## <environment: 0x7fb2cddf1918>

plot of chunk unnamed-chunk-23

Variation And Change

This would be weird:

\[\text{fp}\rightarrow\text{ə}\left<\begin{array}{c}\text{m}\\\emptyset\end{array}\right>\]

Variation

grammar

Variation

Persistence, from Tamminga (2014) persistence

The Media

ukip

Conditional Probabilities

Linguists have written about:

  • \(p =P(\text{um}~|~\text{gender})\)

Someone else has written about(?):

  • \(q = P(\text{gender}~|~\text{ukip})\)

Conditional Probabilities

\(P(\text{ukip}~|~\text{um}) = \mathcal{M}(p,q) \approx 1\)

  • \(\mathcal{M}(x)=\) the media bollocks fuction

Informativeness

Is this a signal?

plot of chunk unnamed-chunk-24

Mutual information

Hang 1 lamp if the British are coming by land, 2 if by sea.

  • \(P(\text{1 lamp}~|~\text{land})=1\)
  • \(P(\text{2 lamps}~|~\text{sea})=1\)

Mutual information

The amount of information to be communicated depends on how likely the different outcomes are:

by_land by_sea entropy
0.1 0.9 0.47
0.2 0.8 0.72
0.5 0.5 1.00
0.8 0.2 0.72
0.9 0.1 0.47

Mutual Information

The quality of the signal depends on how strictly it covaries with the message:

  • \(P(\text{1 lamp}~|~\text{land})=0.8\)
  • \(P(\text{2 lamps}~|~\text{sea})=0.8\)
  • \(P(\text{land}) = 0.8\)

The Joint Distribution

by land by sea margin
1 lamp 0.64 0.04 0.68
2 lamps 0.16 0.16 0.32
margin 0.8 0.2 1

Mutual information

The Mutual Information between message and signal:

  entropy(c(0.2, 0.8)) +      # message uncertainty
  entropy(c(0.68, 0.32)) -    # signal uncertainty
  entropy(c(0.64, 0.04,       # joint uncertainty
            0.16, 0.16))
## [1] 0.1825
  # bits that could've been could've been communicated
  # with a perfect signal
  entropy(c(0.2, 0.8)) 
## [1] 0.7219

Informativeness?

plot of chunk unnamed-chunk-30

Informativeness?

## Warning: NAs introduced by coercion

plot of chunk unnamed-chunk-32

Informativness

Need to compare this to some other kind of signal.

  library("babynames")
  head(babynames, 3)
## Source: local data frame [3 x 5]
## 
##   year sex name    n    prop
## 1 1880   F Mary 7065 0.07238
## 2 1880   F Anna 2604 0.02668
## 3 1880   F Emma 2003 0.02052
  tail(babynames, 3)
## Source: local data frame [3 x 5]
## 
##   year sex   name n      prop
## 1 2013   M Zymari 5 2.499e-06
## 2 2013   M Zymeer 5 2.499e-06
## 3 2013   M  Zyree 5 2.499e-06

Comparative Informativeness

plot of chunk unnamed-chunk-35

Bayesian Update

One off update:

plot of chunk unnamed-chunk-36